SPLITTING METHODS FOR CONVEX CLUSTERING 

ERIC C. Cfflt AND KENNETH LANGE* 

Abstract. Clustering is a fundamental problem in many scientific applications. Standard 

methods such as fe-means, Gaussian mixture models, and hierarchical clustering, however, are beset 

by local minima, which are sometimes drastically suboptimal. Recently introduced convex relaxations 

of fc-means and hierarchical clustering shrink cluster centroids toward one another and ensure a unique 

global minimizer. In this work we present two splitting methods for solving the convex clustering 

problem. The first is an instance of the alternating direction method of multipliers (ADMM); the 

second is an instance of the alternating minimization algorithm (AMA). In contrast to previously 

C^ considered algorithms, our ADMM and AMA formulations provide simple and unified frameworks 

T~H for solving the convex clustering problem under the previously studied ^i,£2i and E^^ norms and 

C _ ^ open the door to potentially novel norms. We demonstrate the performance of our algorithm on 

fS| both simulated and real data examples. 
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, 1 1. Introduction. In recent years convex relaxations of many fundamental, yet 

^^ combinatorially hard, optimization problems in engineering, applied mathematics, 

^^ and statistics have been introduced [62]. Good, and sometimes nearly optimal solu- 

tions, can be achieved at affordable computational prices for problems that appear 
at first blush to be computationally intractable. In this paper we introduce two new 
algorithmic frameworks based on variable splitting that generalize and extend recent 
efforts to convexify the classic unsupervised problem of clustering. 

Lindsten et al. [40] and Hocking et al. [34] formulate the clustering task as a convex 
Q>^ optimization problem. Given p points Xi, . . . , Xp in M', they suggest minimizing the 

^\ convex criterion 

O 1 ^ 

m 

I where 7 is a positive tuning constant, Wij is a nonnegative weight, and the ith column 

Uj of the matrix U is the cluster center attached to point x^. Lindsten et al. [40] 
consider an £p norm penalty on the differences u^ — Uj while Hocking et al. [34] 
consider £1, £2, and €00 penalties. In the current paper an arbitrary norm defines the 



X 

^ penalty. 

cd 



The objective function bears some similarity to the fused lasso signal approxima- 
tor [59]. In fact, the graphical fused lasso as described in [60] is a special case of the 
problem studied here. In the graphical interpretation of clustering, each point corre- 
sponds to a node in a graph, and an edge connects nodes i and j whenever Wij > 0. 
Figure 1.1 depicts an example. In this case the objective function F^{\5) separates 
over the connected components of the underlying graph. Thus, one can solve for the 
optimal U component by component. Without loss of generality, we assume the graph 
is connected. 
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Fig. 1.1: A graph with positive weights wi2, 1^15, W34 and ah other weights Wij = 0. 






Fig. 1.2: Cluster path assignment: The simulated example shows five well separated 
clusters and the assigned clustering from applying the convex clustering algorithm 
using an ^2-iiorm. The blue lines trace the path of the individual cluster centers as 
the regularization parameter 7 increases. 



When 7 = 0, 
a unique cluster. 



the minimum is attained when u^ = x^, and each point occupies 
As 7 increases, the cluster centers begin to coalesce. Two points 



Xi and Xj with u^ = Uj are said to belong to the same cluster. For sufhciently high 
7 all points coalesce into a single cluster. Because the objective function F^{XJ) in 
equation (1.1) is strictly convex and coercive, it possesses a unique minimum point 
for each value of 7. If we plot the solution matrix U as a function of 7, then we 
can ordinarily identify those values of 7 giving k clusters for any integer k between p 
and 1. In theory k can decrement by more than 1 as certain critical values of 7 are 
passed. Indeed, when points are not well separated, we observe that many centroids 
will coalesce abruptly unless care is taken in choosing the weights Wij. 

The benefits of this formulation are manifold. As we will show, convex relaxation 
admits a simple and fast iterative algorithm that is guaranteed to converge to the 
unique global minimizer. In contrast, the classic fc-means problem has been shown to 
be NP-hard [1, 12]. In addition, the classical greedy algorithm for solving fc-means 
clustering often gets stuck in suboptimal local minima [20, 41, 43]. 

Another vexing issue in clustering is determining the number of clusters. Agglom- 
erative hierarchical clustering [30, 36, 39, 49, 65] finesses the problem by computing 
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an entire clustering path. Agglomerative approaches, however, can be computation- 
ally demanding and also tend to fall into suboptimal local minima since coalescence 
events are not reversed. The alternative convex relaxation considered here performs 
continuous clustering just as the lasso [9, 58] performs continuous variable selection. 
Figure 1.2 shows how the solutions to the alternative convex problem traces out an 
intuitively appealing, globally optimal, and computationally tractable solution path. 

1.1. Contributions. Our main contributions are two new methods for solving 
the convex relaxation. Relatively little work has been published on algorithms for 
solving this optimization problem. Lindsten et al. [40] used an off-the-shelf convex 
solver, CVX [11, 31], to generate solution paths. Hocking et al. [34] introduced three 
distinct algorithms for the three most commonly encountered norms. Given the (.i 
norm and unit weights Wij^ the objective function separates, and they solve the convex 
clustering problem by the exact path following method designed for the fused lasso 
[35]. For the f-i and ^2 norms with arbitrary weights Wij^ they employ subgradient 
descent in conjunction with active sets. Finally, they solved the convex clustering 
problem under the £00 norm by viewing it as minimization of a Frobenius norm over 
a polytope. In this guise the problem succumbs to the Frank- Wolfe algorithm [22] of 
quadratic programming. 

In contrast to this piecemeal approach, we introduce two similar generic frame- 
works for minimizing the convex clustering objective function with an arbitrary norm. 
One approach solves the problem by the alternating direction method of multipliers 
(ADMM), while the other solves it by the alternating minimization algorithm (AM A). 
The key step in both cases requires computing the proximal map of a given norm. 
Consequently, both of our algorithms apply provided the penalty norm admits efficient 
computation of its proximal map. Both algorithms are also amenable to acceleration. 

In addition to introducing new algorithms for solving the convex clustering prob- 
lem, the current paper contributes in other concrete ways: (a) We characterize both 
of the new algorithms theoretically. This entails proving the convergence of their 
iterates and explicitly showing how their computational complexity scales with the 
connectivity of the underlying graph, (b) We provide new proofs of intuitive prop- 
erties of the solution path. These results are tied solely to the minimization of the 
objective function (1.1) and hold regardless of the algorithm used to find the mini- 
mum point, (c) We suggest stopping rules based on suboptimality conditions, (d) We 
provide guidance on how to choose the weights wij. Our suggested choices diminish 
computational complexity and enhance solution quality. 

1.2. Related Work. The literature on clustering is immense; the reader can 
consult the books [29, 32, 38, 48, 67] for a comprehensive review. The clustering func- 
tion (1.1) can be viewed as a convex relaxation of either fc-means clustering [40] or 
hierarchical agglomerative clustering [34] . Both of these classical clustering methods 
[56, 57, 65] come in several varieties. The literature on fc-means clustering reports 
notable improvements in the computation [17] and quality of solutions [4, 8, 38] deliv- 
ered by the standard greedy algorithms. Faster methods for agglomerative hierarchical 
clustering have been developed as well [21]. Many statisticians view the hard cluster 
assignments of fc-means as less desirable than the probabilistic assignments generated 
by mixture models [45, 61]. Mixture models have the advantage of gracefully assign- 
ing points to overlapping clusters. These models are amenable to the EM algorithm 
and can be extended infinite mixtures [18, 53, 50]. 

Alternative approaches to clustering involve identifying components in the as- 
sociated graph via its Laplacian matrix. Spectral clustering [42] can be effective in 
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cases when the clusters are non-convex and linearly inseparable. Although spectral 
clustering is valuable, it is not in conflict with convex relaxation. Indeed, Hocking et 
al [34] argue that convex clustering can be effectively merged with spectral clustering. 
Although we agree with this point, the solution path uncovered by convex clustering 
is meritorious in its own right because it partially obviates the persistent need for 
determining the number of clusters. 

1.3. Notation. Throughout, scalars are denoted by lowercase letters (a), vectors 
by boldface lowercase letters (u), and matrices by boldface capital letters (U). The 
jth column of a matrix U is denoted by Uj. At times in our derivations, it will be 
easier to work with the vectorization of matrices. We adopt the convention of denoting 
the vectorization of a matrix (U) by its lower case letter in boldface (u). Finally, we 
denote sets by upper case letters (B). 

1.4. Organization. The rest of the paper is organized as follows. We first 
characterize the solution path theoretically. Previous papers take intuitive properties 
of the path for granted. We then review the ADMM and AMA algorithms and adapt 
them to solve the convex clustering problem. Once the algorithms are specified, we 
discuss their computational complexity, convergence, and acceleration. The paper 
concludes with a few numerical examples and a general discussion. 

2. Properties of the solution path. We prove that the solution path U(7, w), 
as a function of the regularization parameter 7 and its weights w = {wij}, has several 
nice properties that expedite its numerical computation. 

Proposition 2.1. The solution path 11(7) exists and depends continuously on 
7. The path also depends continuously on the weight matrix w. 

Proof. The existence and uniqueness of U(7) are immediate consequences of 
the coerciveness and strict convexity of F^(U). To prove continuity in 7, consider a 
subinterval [a, b] and fix an arbitrary point U. For any 7 e [a, b] the inequalities 

i^a[U(7)]<f^[U(7)] < F^(U) < F,{IJ) 

hold because F^{XJ) is non-decreasing in 7 and 11(7) minimizes Fj{XJ). Since Fa{XJ) is 
coercive, U(7) is bounded for all 7 G [a, b] . Suppose that U(7) fails to be continuous at 
7. Then there is an e > and a sequence 7^ tending to 7 such that ||U(7fc) — U(7)|| > e 
for aU k. Because the U(7fc) are bounded, we can pass to a subsequence if necessary 
and assume that U(7fc) converges to a point U. Because -^^(U) is continuous in (7, U), 
taking limits in the inequality Fj^[U{'jk)] < ^7fc(U) shows that -^^(U) < P-yi^) for 
arbitrary U. Since U(7) is uniquely determined, this implies that U = U(7), which 
obviously contradicts the assumption that ||U(7fe) — U(7)|| > e for all k. The proof 
of the continuity of U in w proceeds along similar lines but is notationally more 
cumbersome. Details are left to the reader. D 

Existence and uniqueness of U sets the stage for a well-posed optimization prob- 
lem. Continuity of U suggests employing homotopy continuation. Indeed, empirically 
we find great time savings in solving a sequence of problems over a grid of 7 values 
when we use the solution of a previous value of 7 as a warm start or initial value for 
the next larger 7 value. 

We next prove that the centroids eventually coalesce to a common point. For the 
example shown in Figure 1.1, we intuitively expect for sufficiently large 7 that the 
columns of U satisfy U3 = U4 = X34 and Ui = U2 = U5 = X125, where X34 is the mean 
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of X3 and X4 and X125 is the mean of xi, X2, and X5. The next proposition confirms 
our intuition. 

Proposition 2.2. Suppose each point corresponds to a node in a graph with an 
edge between nodes i and j whenever Wij > 0. // this graph is connected, then F^{\J) 
is minimized by X for 7 sufficiently large, where each column of X equals the average 
X of the n vectors x.; . 

Proof. A point X furnishes a global minimum of the convex function ^^(X) if 
and only if all forward directional derivatives d&F^pC) at X are nonnegative. At the 
average matrix X a brief calculation demonstrates that 

p 
dei^^(X)=^(x-x„0,)+7^u-,,|10,-0,|l. (2.1) 

i—l i<j 

To construct a lower bound on this quantity, note that 

p 
^(x-x„0,)=O 

for all j. It follows that 

^^(x-x„0,)=O. 
3=1 i=i 

The generalized Cauchy-Schwartz inequality therefore implies 

p p 



f -Iff 

^(X - Xj, 0,) = - X! X!^^ ^ Xj, 0i) 
i=\ '^ j = l 1=1 






^--Eii^-^'iitii^^-^j 



p ■ 

for the norm || • ||| dual to || • ||. In view of expression (2.1), it suffices to prove that 7 
can be taken so large that 

7^] m.,|10.- 0,11 >-^|lx-x,|ltll0,;- 0,11 (2.2) 

i<j i<j 

for all 0. 

When all Wij > 0, one can take any 7 that exceeds 

2 ||x-x,|lt 
- max . 

P ij Wtj 

In general set 

2 

P = -. max|lx-x.,|U. 

p mm^.^:^o Wij i 
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For any pair i and j there exists a path i^k^---^l^j along which the weights 
are positive. It follows that 



\e,-9j\\<\\9,-ek\\ + --- + \\ei-e^ 



ji 



and that 



Hence, 



-||x-x,||t||0,-0,|| < py wkiWOk - ei 

^ k<l 



^ i<j ^ ^ fe<; 

and any 7 > (2)/? satisfies the forward directional derivative test. D 

We close this section by noting that in general the clustering paths are not guar- 
anteed to be agglomerative. In the special case of the £1 norm with uniform weights 
Wij = 1, Hocking et al [34] prove that the path is agglomerative. In the same pa- 
per they give an £2 norm example where the centroids fuse and then unfuse as the 
regularization parameter increases. 

3. Algorithms to Compute the Clustering Path. Having characterized 
the solution path U(7), we now tackle the task of computing it. We present two 
closely related optimization approaches: the alternating direction method of multi- 
pliers (ADMM) [6, 24, 25] and the alternating minimization algorithm (AMA) [63]. 
Both approaches employ variable splitting to handle the shrinkage penalties in the 
convex clustering criterion (1.1). 

3.1. Reformulation of Convex Clustering. Let us first recast the convex 
clustering problem as the equivalent constrained problem 

1 ^ 
minimize - ^ ||x, - uj^ + ^^ u;;||v,|| 

^ j=i I i-^-lJ 

subject to u/j — u/2 — V; = 0. 

Here we index a centroid pair by / = ih^h) with li < I2 and introduce a new variable 
V; — ujj — ui^ to account for the difference between the two centroids. The purpose 
of variable splitting is to simplify the penalties. The Lagrangian 

1 ^ 

i=l I I 

for problem (3.1) leads to the dual function 

D^(A)= inf £o(U,V,A) 
' u,v 

1 ^ (3-2) 

= -2^||A,||^_^(A,,x,,-x,J-^^c,(A/), 



1=1 
where 
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The set Ci indexing the convex indicator function Sci in formula (3.2) equals the dual 
ball {A; : ||A;|j| < •ywi}. We will derive the dual function in a moment. 

Splitting methods such as ADMM and AMA have been successfully used to attack 
similar problems in image restoration [28]. ADMM and AMA are now motivated as 
variants of the augmented Lagrangian method (ALM) [33, 51, 52, 54]. Let us review 
how ALM approaches the constrained optimization problem 

minimize /(u) + g(v) 
subject to Au + Bv = c, 

which includes the constrained minimization problem (3.1) as a special case. ALM 
solves the equivalent problem 

minimize f(u) + n(-v) -\ — jlc — Au — Bvjlo, 
subject to Au + Bv = c 

by imposing a quadratic penalty on deviations from the feasible set. The two problems 
(3.3) and (3.4) are equivalent because their objective functions coincide for any point 
(u, v) satisfying the equality constraint. The Lagrangian for the ALM problem 

C,{u, V, A) = /(u) + ,g(v) + (A, c - Au - Bv) + ^||c - Au - Bvjj^ 

invokes the dual variable A as a vector of Lagrange multipliers. If /(u) and ^(v) are 
convex and A and B have full column rank, then the objective (3.4) is strongly convex, 
and the dual problem reduces to the unconstrained maximization of a concave function 
with Lipschitz continuous gradient. The dual problem is therefore a candidate for 
gradient ascent. In fact, this is the strategy that ALM takes in the updates 

(u"+\ v"+i) = argmin /:^(u, V, A™) 

(3.5) 
A"+i = A™ + u{c - Au'"+i - Bv'"+i). 

Unfortunately, the minimization of the augmented Lagrangian over u and v 
jointly is often difhcult. ADMM and AMA adopt different strategies in simplify- 
ing the minimization subproblem in the ALM updates (3.5). ADMM minimizes the 
augmented Lagrangian one block of variables at a time. This yields the algorithm 

u'"+i ^ argmin /:^(u, v", A") 

U 

v™+i = argmin C(u™+\v, A") (3.6) 

V 

yn+i = A™ + i/(c - Au™+i - Bv"+i). 

AMA takes a slightly different tack and updates the first block u without augmenta- 
tion, assuming /(u) is strongly convex. This change is accomplished by setting the 
positive tuning constant v to be 0. The overall algorithm iterates according to 

u™+i = arg min Co{u, v", A") 

U 

v™+i = argmin /:,(u™+\ v. A") (3.7) 

V 

yn+l ^ ^m + j,(c - AU™+1 - Bv'"+1). 
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Although block descent appears to complicate matters, it often markedly simplifies 
optimization in the end. In the case of convex clustering, the updates are either simple 
linear transformations or proximal maps. 

Before moving on to discuss specific incarnations of the splitting algorithms and 
their implementation via proximal maps, it is helpful to recall how one calculates the 
dual to problem (3.3). The Lagrangian of the problem is 

£o(u, V, A) - /(u) + .g(v) + (A, c - Au - Bv) . 

This translates into the dual function 

V{X) = inf £o(u,v,A) 

U,V 

= inf {/(u)-(A*A,u)}+inf {.g(v) - (B*A, v)} + (A, c> 

U V 

--r(A*A)-,g*(B*A) + (A,c) 
= -r(A'A)-.9(A), 

involving Fenchel conjugates. If /(u) is strongly convex, then f*{A*\) has Lipschitz 
continuous gradient [37]. If 5(v) is a convex and lower semicontinuous function, 
then the proximal map of the partial dual 5(A) — g*(B*A) — (A, c) is well defined. 
Consequently, the dual problem is a prime candidate for solution by the proximal 
gradient algorithm [10] 

A"+i = prox,g[A'" - i/AVr(A*A")] (3.8) 

with sufficiently small step size i'. Despite appearances, the updates (3.7) are actually 
equivalent to the proximal gradient update (3.8). Proofs of this fact can be found in 
the papers [27, 63]. We present an alternative proof in Appendix A. 

Calculation of the dual problem for convex clustering fits within this framework. 
The key ingredient is the Fenchel conjugate pair 

/(u) = ^||x-u||2 and r(z) = i||z||^ + (z,x). 

In the convex clustering problem B — — Iqe, where q is the number of features and 
s is the number of edges in the incidence matrix generated by the positive weights 
Wij. A little thought shows that A can be expressed via Kronecker products and the 
standard basis of R'' as A* = (A^ • • • A* ) and A* = [e;^ — e;^] ig) Ig. In this notation 

r(A*A) = i||A*A||^ + (A*A,x) 
I I 

-i— 1 li—i ^2— i ^ 

The passage from the third expression to the fourth expression on the right of this 



Splitting Methods for Convex Clustering 



string of equalities is justified by the observation 



^ [e;^ «) A; - ei^ (g) Xi] 



\l2h=p^i - Y.i2=p^i-, 



and the appropriate grouping of sums of squares. The partial dual 5(A) — 5* (A) in 
the clustering context is 



5r*(A) = sup 






sup 



= ^SUp (A;,V;) -7W/||v;|| 
, V, L J 



where Ci — {z : ||z||| < ^wi}. 

3.2. Proximal Map. For u > the function 



Prox^n(u) =argmin 



ar!(v) + -||u-v||2 



is a well-studied operation called the proximal map of the function r2(v) . The proximal 
map exists and is unique whenever the function r2(v) is convex and lower semicon- 
tinuous. Norms satisfy these conditions, and for many norms of interest the proximal 
map can be evaluated by either an explicit formula or an efficient algorithm. Table 3.1 
lists some common examples. The proximal maps for the ii and £2 norms have ex- 
plicit solutions and can be computed in 0{q) operations for a vector v £ K'?. The 
proximal map for the ^00 norm requires projection onto the unit simplex and lacks 
an explicit solution. However, there are good algorithms for projecting onto the unit 
simplex [15, 46]. In particular, Duchi et al.'s projection algorithm makes it possible 
to evaluate prox^iui (v) in O{q\ogq) operations. The ^1,2 norm 



gee 



^Sl|2 



listed in Table 3.1 partitions the components of v into non-overlapping groups Q. 
In this case there is a simple shrinkage formula. If the groups overlap, then Mairal 
et al. [44] have shown that the associated proximal map can computed exactly in 
polynomial time by solving a quadratic min-cost flow problem. 

3.3. ADMM updates. The augmented Lagrangian is given by 



j=i I 



(3.9) 



2 

2ll2- 
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Table 3.1: Proximal maps for common norms. 



Norm 




rj(v) prox^s^(v) 


Comment 


il 




l|v||i 


1- rS 


+ 


Element-wise soft-thresholding 


ii 




I|V||2 


[l II '^1 

l|v||2 


V 

+ 


Block-wise soft-thresholding 


^oo 




l|v||oo v-7'„5(v) 


S is the unit simplex 


4,2 


E 


gGsll^slls 


1 '^ 


+"^ 


CJ is a partition of {1, ... ,p} 


^ I|V,||2 



The gradient of £,y(U, V, A) with respect to u, is 

— /:^(U,V,A) = u,, -X, - ^ A; + ^ A, -j/^(v, -u., + u,J 

l\^i l^^i l\^i 

To update U, we set g^£i,(U, V, A) — and rearrange. This gives the equation 

p 

where 

y,: = X,; + ^ [A; + v\i\ - X! t-^' + ^"^'l- 

When we sum the equalities Tp-£j^(U, V, A) = on z and exploit symmetric cancel- 
lations, we deduce the identity X]f=i "« = Y^i=\ ^i- Equation (3.10) therefore implies 
that the columns u^ of the optimal matrix U satisfy 

1 pv - 



1 + pv 1 + pv 

To update the V, we first observe that the Lagrangian £i,(U, V,A) is separable 
in the vectors v; . A particular difference vector v; is determined by the proximal map 

V, =argmin-||v-(u,, - u,, -v-^\i)\\l + ^||v|| 

V ^ I-' (^-11) 

where u; = ^wi. Finally, the Lagrange multipliers are updated by 

A; = A; + v{-Vi -U/i +U;J. 

When wi — 0, the updates for A; and v; simplify to 

V; = Uij - u,2 and A; = 0. 
Algorithm 1 summarizes the updates. 
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Algorithm 1 ADMM 



Initialize A° and V°. 

for m = 1, 2,3, ... do 
for i ~ 1, . . . ,p do 



2 
3 
4 
5 
6 
7; 
8 
9 
10 



rn— 1 I ,,,,rn— li \-^ i\m—l _, ,._.rM— 1] 



end for 



for all / do 

Ar = Ar'+Kvr-u;7+u™) 

end for 
end for 



3.3.1. Stopping Criterion for ADMM. To track the progress of ADMM we 
rely on computing the primal and dual residuals [6]. The necessary and sufficient 
conditions for (u*,v*, A*) to be an optimal solution to (3.3) are primal feasibility 

Au*+Bv*=c, (3.12) 

and dual feasibility 

0e9/(u*)+A*A* (3.13) 

Oeag(v*)+B*A*. (3.14) 

After the mth round of updates (3.6), v™ and A™ satisfy condition (3.14). This 
suggests relying on conditions (3.12) and (3.13) to track the progress of ADMM. 
Inspection of the updates (3.6) reveals that u™+^ satisfies 

e a/(u"+i) + A*A"+i + j/A*B(v'" - v^+i), 

or after rearrangement 

t/A*B(v'"+i - V™) e a/(u"+i) + A*A"+\ 

The quantity on the left is referred to as the dual residual since it quantifies the 
deviation of the iterates from dual feasibility. In the convex clustering problem, there 
are p dual residual vectors 



^™+i ^ ^ ^[vr+^ - vn - Et^r^' - ^" 



The primal residuals are given by 






Let S™ and R™ denote the matrices with ith columns s™ and r™ respectively. If we 
wish to base our stopping on a measure of suboptimality, then following Frobenius 
norm inequality 



In,, -1^112 



X-U"||^+7}^™d|vri|-F7(U*)<||A'"||F||R"||F + ||U"-U*||F||S'"|JF 
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comes in handy. The derivation of this bound can be found in appendix A of [6] . Of 
course, the term on the right cannot be computed since U* is unknown. We will show 
later, however, that (U™, V™) converges to (U*,V*). Therefore, S™ converges to 
and ||U™ — U*||f < ||X — X||f for sufficiently large m. Thus, we can terminate when 

|!A™||f||R"||f + ||X-X||f||S™||f<t, 

for some tolerance r > 0. This rule is engineered to track the suboptimality but does 
not gives rigorous guarantees, unlike the one we will see later for AMA. 

3.4. AMA updates. Since AMA shares its update rules for V and A with 
ADMM, consider updating U. Recall that AMA updates U by minimizing the ordi- 
nary Lagrangian (v — case), namely 

U"+i = argmin ^ ^ ||x, - u^g + ^(A^, v, - u,, + u,,). 



In contrast to ADMM, this minimization separates in each u^ and gives the much 
simpler update 



ur+^ = x, + ^Ar-^AF 



Further scrutiny of the updates for V and A reveals additional simplifications. 
Moreau's decomposition [10] 

z = proXf,,(z) +tproXt-i^,(t"iz). 

allows one to express the proximal map of a function h in terms of the proximal map 
of its Fenchel conjugate h* . This decomposition generalizes the familiar orthogonal 
projection decomposition, namely z — Vwi'^^+Vw^ (z) where Ty is a closed Euclidean 
subspace and W^ is its orthogonal complement. If /i(z) = ||z|| is a norm, then 
/i*(z) = 6b{z) is the convex indicator function of the unit ball B — {y : ||y|j-f < 1} of 
the dual norm. Because the proximal map of the indicator function of a closed convex 
set collapses to projection onto the set, Moreau's decomposition leads to the identity 

proxj,,(z) = z- tproxt-i5^(i"^z) = z - <7'_B(t"^z) = z -Vtsi^), (3.15) 

where Vb{'^) denotes projection onto B. In this derivation the identity I^^Sb = 5b 
holds because 5b takes only the values and oo. Applying the projection formula 
(3.15) to the V; update (3.11) yields the revised update 



\ U 



for the constant t = ai/v ^ ^wi/v. 
The update for A; is given by 

A™+i = Af' + J^(v™+i - uj^'+i + u^+i). 

Substituting for the above alternative expression for v™+^ leads to substantial can- 
cellations and the revised formula 

■ym-l-l _ ,,-T-, r„m+l „™+l ,,-l\™l 
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Algorithm 2 AMA 



Initialize A . 
1: for TO = 1, 2, 3, ... do 



for i ~ 1 , . . . , p do 

end for 
for all / do 

end for 
end for 



The identities — PtB(z) = Vtsi^'^') and aVtsiz) — 7'atB(flz) for a > furtlier simplify 
the update to 

I =i^cA\ ~v^i ), 

where g[" = u™ — uj" and C; = {A; : || Ai|j| < 7W/}. These updates are not surprising 
since, as noted earlier, the AMA algorithm is performing proximal gradient ascent on 
a dual problem. Algorithm 2 summarizes the AMA algorithm. 

3.4.1. Stopping Criterion for AMA. The explicit functional forms (3.1) and 
(3.2) of the primal and dual functions make it trivial to evaluate the duality gap for 
feasible variables, since they depend on the quantities A^ and g; = u™ — u™, which 
are computed in the process of making the AMA updates. We can stop the AMA 
algorithm when the duality gap is small relative to the the optimal objective value. 
While we can define stopping rules that guarantee solutions within a tolerance on the 
suboptimality, in practice, we stop when 

f,(U-)-J,(A") 
l + i[F^(U™) + Z?^(A'")] 

for T > small. The average |[i^^(U'") + £'^(A'")] serves as an estimate of F^(U*). 

4. Convergence. Both ADMM and AMA converge under reasonable condi- 
tions. 

4.1. AMA. Tseng [63] provides sufficient conditions to ensure the convergence of 
AMA. In the following list of assumptions, the functions /(u) and (/(v) and parameters 
A, B, and c refer to problem (3.3). 

Assumption 4.1 (Assumptions B and C in [63]). 

(a) /(u) and g(v) are convex lower-semicontinuous functions. 

(h) /(u) is strongly convex with modulus a > 0. 

(c) Problem (3.3) is feasible. 

(d) The function g{v) + ||Bv||2 has a minimum. 

(e) The dual of (3.3) has an optimal Lagrange multiplier corresponding to the 
constraint Au + Bv = c. 

It is straightforward to verify that the functions and parameters in problem (3.3) 
satisfy Assumption 4.1. In particular, the strong convexity modulus a = 1 for the 
convex clustering problem. 

Proposition 4.2 (Proposition 2 in [63]). Under Assumption 4.1 the iterates 
generated by the AMA updates (3.7) satisfy the following: 
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(a) lim„^oo u™ = u* , 

(h) lim„^ooBv™ = c-Au*, 

(c) lim„j^oo A" = A* , 
provided that v < 2a/p{A A), where p(A A) denotes the largest eigenvalue of A A. 

The parameter i/ controlling the gradient step must be strictly less than twice the 
Lipschitz constant l/p(A*A). To gain insight into how to choose v, let e < (2) denote 
the number of edges. Then A = $ (g) I, where # is the e x p oriented edge-vertex 
incidence matrix 

{1 If node V is the head of edge / 
— 1 If node V is the tail of edge I 
otherwise. 

As noted earlier, A* A = L(8)I, where L = $*$ is the Laplacian matrix of the associ- 
ated graph. It is well known that the eigenvalues of Z(J§I coincide with the eigenvalues 
of Z. See for example Theorem 6 in Chapter 9 of [47]. Therefore, p(A*A) ~ /o(L). In 
general p(L) < p [3] with equality when the graph is fully connected, namely when 
Wij > for all i < j. Choosing a fixed step size of z^ < 2/p certainly works in prac- 
tice but may be too conservative. We adopt backtracking to achieve better overall 
performance. Details on our backtracking scheme are discussed in Section §7.3. 

4.2. ADMM. Convergence of the ADMM algorithm has been proved under very 
modest assumptions, which we now restate. 

Proposition 4.3. If the functions /(x) and g{x) are closed, proper, and convex, 
and the unaugmented Lagrangian has a saddle point, then the ADMM iterates satisfy 

lim r" = 

m—^oo 

lim [/(U™)-fg(V™)]=p* 
lim A^^A*, 

where r™ = c — Au™ — Bv™ denotes the primal residuals and p* denotes the minimal 
objective value of the primal problem. 

Proofs of the above result can be found in the references [6, 16, 23]. Since the 
convex clustering criterion ^^(U) defined by equation (1.1) is strictly convex and 
coercive, we have the stronger result that the ADMM iterate sequence converges to 
the unique global minimizer U* of F-y (U) . 

Proposition 4.4. The iterates U™ in Algorithm 1 converge to the unique global 
minimizer U* of the clustering criterion F^ (U) . 

Proof. The conditions required by Proposition 4.3 are obviously met by -F^(U). 
In particular, the unaugmented Lagrangian possesses a saddle point since the primal 
problem has a global minimizer. To validate the conjectured limit, we first argue 
that the iterates (U™,V™) are bounded. If on the contrary some subsequence is 
unbounded, then passing to the limit along this subsequence contradicts the limit 

lim i/^(U™,V")=F^(U*) (4.1) 



guaranteed by Proposition 4.3 for the continuous function 

p 
2 



^7(U,V) = i^l|x,-U,;||2+^^^,||v, 



/ 
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Algorithm 3 Fast AMA 



Initialize A ^ = A , ao = 1 


1 


for m = 0,1,2,... do 


2 


for i ~ I, . . . ,p do 


3 


Ar = Ez,=.Ar^-E.,=.Ai 


4 


end for 


5 


for all / do 


6 


Sr - X,, - X,, + AT, AT, 


7 


\T^PcXK''-'^sr) 


8 


end for 


9 


a™ = (l + ^l + 4<_i)/2 


10 


A-'+i-r+^-MA'" A'""'] 


11 


end for 



To prove convergence of the sequence (U™, V™), it therefore suffices to check that 
every limit point coincides with the minimum point of F^(U). Let (U™"jV™") 
be a subsequence with limit (U,V). According to Proposition 4.3, the differences 
u™ — u™ — V™ tend to 0. Thus, the limit (U, V) is feasible. Furthermore, 

lim i/^(U"", V'"") = HJIJ,Y) = F^(U). 

n— foo 

This limit contradicts the limit (4.1) unless F^(U) = F~f{XJ*). Because U* uniquely 
minimizes F-y(U), it follows that U = U*. D 

5. Acceleration. Both AMA and ADMM admit acceleration at little additional 
computational cost. Given that AMA is a proximal gradient algorithm, Goldstein et 
al. [27] show that it can be effectively accelerated via Nesterov's method [5]. Al- 
gorithm 3 conveys the accelerated AMA method. Goldstein et al. [27] also present 
methods for accelerating ADMM. Algorithm 4 summarizes an accelerated version of 
ADMM found to work well in practice. As in Nesterov's method, updates are ex- 
trapolated. However, acceleration is restarted whenever the larger of the primal and 
dual residuals increases. Thus, this ADMM formulation forces the largest residual to 
decrease monotonically. 

6. Computational Complexity. 

6.1. AMA. Suppose we wish to compute for a given 7 a solution such that 
the duality gap is at most r. We start by tallying the computational burden for a 
single round of AMA updates. Inspection of Algorithm 2 shows that computing all 
Ai requires q{2e — p) total additions and subtractions. Computing all vectors g; in 
Algorithm 2 takes 0{eq) operations, and taking the subsequent gradient step also 
costs 0{£q) operations. Computing the needed projections costs 0{eq) operations for 
the £1 and (.2 norms and 0{eq\ogq) operations for the ^00 norm. Finally computing 
the duality gap costs 0{pq -\- eq) operations. The assumption that p is 0{e) entails 
smaller costs. A single iteration with gap checking then costs just 0{eq) operations 
for the (.1 and ^2 norms and 0{eq\ogq) operations for the ^00 norm. 

Estimation of the number of iterations until convergence for proximal gradient 
descent and its Nesterov variant complete our analysis. The pq x eq matrix A is 
typically short and fat. Consequently, the function /*(A*A) is not strongly convex. 
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Algorithm 4 Fast ADMM 



Initialize V = V°, A = A", oq = 1 
1; for ?Ti = 1, 2, 3, . . . do 
for i = 1, . . . ,p do 

end for 



2 
3 

4 
5 
6 
7; 

8: 

9: 

10: 

11: 

12: 

13: 

14: 
15: 
16: 
17: 



for all / do 

vr = prox,,||.||/.K-u™-i^-iAn 

~xT = xr + Kvr - K + K) 

end for 

if max(||s™-i||, ||r "'-i||) > niax(||s"||, ||r"^||) then 
"m+i = (1 + v/1 + 4a2 J/2 

A™+i=r + ^[A"-A™"'] 
else 

1 -.^"1+1 f>''" \ni+l x™ 

a,„+i==l,V ^ =V ,A ^ =A . 
end if 
end for 



Therefore, the best known convergence bounds for the proximal gradient method 
and its accelerated variant are sublinear [5]. Specifically we have the following non- 
asymptotic bounds on the convergence of the objective values: 

i.,(r)-i.,(A™)<^^i^M^^ 

for the unaccelerated proximal gradient ascent and 



D,{X*)-D,{Xn< ,^,,, 



2p(A *A)||A*~ A°| 
(to + 1)' 



for its Nesterov accelerated alternative. Thus taking into account operations per iter- 
ation, we see that the unaccelerated version and acceleration algorithms respectively 
require a computational effort of 0{^) and 0{^) respectively for the £i and £2 

norms to attain a duality gap less than r. These bounds are respectively 0( ^'^ °^^ ) 
and 0(£9_£I«2) for the ^00 norm. Total storage is 0{qe + qp). In the worst case e 

is (2) . However, if we limit a node's connectivity to its k nearest neighbors, then 
e is 0{kp). Thus, the computational complexity of the problem in the worst case 
is quadratic in the number of points p and linear under the restriction to fc-nearest 
neighbors connectivity. The storage is quadratic in p in the worst case and linear in 
p under the A:-nearest neighbors restriction. Thus, limiting a point's connectivity to 
its A:-nearest neighbors renders both the storage requirements and operation counts 
linear in the problem size, namely 0{kqp). 

6.2. ADMM. By nearly identical arguments, the complexity of a single round of 
ADMM updates with primal and dual residual calculation requires 0{£q) operations 
for the £1 and £2 norms and 0{eq\ogq) operations for the £00 norm. Likewise storage 
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requirements are 0{qe + qp). Thus, ADMM and AMA share the same complexity 
per update round and the same global storage requirements. Unfortunately, the con- 
vergence rate of the current ADMM algorithm is unknown, rendering it impossible 
to give overall complexity bounds analogous to those for AMA. We note, however, 
that a closely related alternating linearization method and its accelerated variant [26] 
show sublinear iteration complexity similar to AMA and accelerated AMA. 

7. Practical Implementation. This section addresses practical issues of algo- 
rithm implementation. 

7.1. Choosing weights. The choice of the weight can dramatically affect the 
quality of the clustering path. We set the value of the weight between the zth and 
jth points to be Wij = i'f^ -, exp(— 0|jxi — XjlH), where l't- i is 1 if j is among i's 
/c-nearest-neighbors or vice versa and otherwise. The second factor is a Gaussian 
kernel that slows the coalescence of distant points. The constant (j> is nonnegative; 
the value = corresponds to uniform weights. As noted earlier, limiting positive 
weights to nearest neighbors improves both computational efficiency and clustering 
quality. Although the two factors defining the weights act similarly, their combination 
increases the sensitivity of the clustering path to the local density of the data. 

7.2. Making cluster assignments. For both ADMM and AMA cluster assign- 
ments can be performed easily once we estimate V. The columns v; of V that are 
or nearly so create an adjacency list of nodes belonging to the same cluster. We 
apply breadth first search on this adjacency list to identify the clusters. 

7.3. Implementing backtracking in AMA. Backtracking is a useful adjunct 
to the AMA algorithm. Recall that AMA performs proximal gradient ascent to max- 
imize the dual function D^{\) defined by equation (3.2). Let h{\) denote the smooth 
part of D^{X) ignoring the indicator functions. The gradient of h{\) can be computed 
blockwise as 

Va,/i(A) = x,^ - x,, + A,^ - A,, 



A potential update A"*" is calculated by taking a forward step of length v in the 
direction Vft-(A) and then projecting the result backward onto the corresponding 
constraint sets. These two steps are expressed blockwise by 

Wc then check to see if the local quadratic approximation to h{X) increases. This 
amounts to checking whether the inequality 



MA+) > MA"-^) + E(^^ - Ar \ Va,Ma'"-^)) - ^11 a+ - a 



™-l|,2 



holds. If it does, then we accept A"*" as our update. Otherwise, we decrease v and try 
again. This rule ensures convergence in both the regular proximal gradient algorithm 
as well as its Nesterov accelerated version [5] . Algorithm 5 shows our AMA approach 
with both backtracking and Nesterov acceleration. 
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Algorithm 5 Accelerated AMA with backtracking 



Initialize i/ > and A°. Take 77 > 1. Set ai = 1 and s° = A°. 

> Compute gradient 



1 


for TO = 1,2,3,... do 


2 


for i — 1, . . . ,p do 


3 


A "I V^ \m— 1 v^ \m — l 


4 


end for 


5 


for all / do 


6 


gr-x,,-x,, + A™-A;^ 


7 


end for 


8 


loop 


9 


for all I do 


10 


A+ = Pc,(Ar"'-^gr) 


11 


end for 


12 


if/(A+)</(A'"-i)+Ez(A^-' 


13 


V -i— zz/ry 


14 


else 


15 


break 


16 


end if 


17 


end loop 


18 


y- ^ A+ 


19 


l + ^l+4a^ 


20 


^"^y'" + (^)(y'"-y'""') 


21 


end for 



i> Backtrack 



1 m— 1||2 



then 



[> Take extrapolated step 



8. Numerical Experiments. We now report numerical experiments on convex 
clustering for a synthetic data set and three real data sets. In particular we focus on 
how the choice of the weights Wij affects the quality of the clustering solution. Prior 
research on this question is limited. Both Lindsten et al. and Hocking et al. suggest 
weights derived from Gaussian kernels and /c-nearest neighbors. Because Hocking et 
al. try only Gaussian kernels, in this section we follow up their untested suggestion of 
combining Gaussian kernels and A;-nearest neighbors. 

We also compare the run times of our splitting methods to the run times of 
the subgradient algorithm employed by Hocking et al. for (.1 and I2 paths. They 
provide R and CH — h code for their algorithms. Our algorithms are implemented in 
R and Fortran. A direct comparison of our code and theirs is impossible since we 
use a different lower-level language and perhaps more importantly different stopping 
criteria. Although we appear to achieve comparable and sometimes more accurate 
solutions in less time, it is worth pointing out that generating the most accurate 
minima to problem (1.1) is somewhat irrelevant. Two approximate solutions with the 
same clusters should be judged equivalent. Our primary purpose is to document that 
the splitting strategy is computationally competitive with existing approaches. 

8.1. Qualitative Comparisons. Our next few examples demonstrate how the 
character of the solution paths can vary drastically with the choice of weights Wij . 

8.1.1. Two Half Moons. Consider the standard simulated data of two inter- 
locking half moons in M^ composed of 100 points each. Figure 8.1 shows four convex 
clustering paths computed assuming two different numbers of nearest neighbors (10 
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I 



Fig. 8.1: HalfiTLOons Example: The first and second rows show resuhs using A: = 10 
and 50 nearest neighbors respectively. The first and second columns show results 
using (j) = and 0.5 respectively. 



and 50) and two different kernel constants 4> (0 and 0.5). The upper right panel makes 
it evident that limiting the number of nearest neighbors (fc = 10) and imposing non- 
uniform Gaussian kernel weights {(f> = 0.5) produce the best clustering path. Using 
too many neighbors and assuming uniform weights results in little agglomerative clus- 
tering until late in the clustering path (lower left panel). The two intermediate cases 
diverge in interesting ways. The hardest set of points to cluster are the points in the 
upper half moon's right tip and the lower half moon's left tip. Limiting the number 
of nearest neighbors and omitting the Gaussian kernel (upper left panel) correctly 
agglomerates the easier points, but waffles on the harder points, agglomerating them 
only at the very end when all points coalesce at the grand mean. Conversely, using 
many neighbors and the Gaussian kernel (lower right panel) leads to a clustering path 
that does not hedge but incorrectly assigns the harder points. 

8.1.2. Fisher's Iris Data. Fisher's Iris data [19] consists of four measurements 
on 150 samples of iris flowers. There are three species present: setosa, versicolor, and 
virginica. Figure 8.2a shows the resulting clustering paths under two different choices 
of weights. On the left Wij = 1 for all i < j, and on the right we used 5-nearest 
neighbors and = 4. Since there are four variables, to visualize results we project 
the data and the fitted clustering paths onto the first two principal components of the 
data. Again we see that more sensible clustering is observed when we choose weights 
to be sensitive to the local data density. We even get some separation between the 
overlapping species virginica and versicolor. 

8.1.3. Senate Voting. We consider senate voting in 2001 on a subset of 15 
issues selected by Americans for Democratic Action [13, 2]. The data is binary. We 
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limited our study to the 29 senators with unique voting records. The issues ranged 
over a wide spectrum: domestic, foreign, economic, mihtary, environmental and social 
concerns. The final group of senators included 15 Democrats, 13 Republicans, and 
1 Independent. Figure 8.2b shows the resulting clustering paths under two different 
choices of weights. On the left Wij = 1 for all i < j, and on the right we used 15- 
nearest neighbors and = 0.5. As observed previously, better clustering is observed 
when we choose the weights to be sensitive to the local data density. In particular, 
we get clear party separation. Note that we have an outlying Democrat in Zel Miller 
and that the clustering seen agrees well with what PGA exposes. 

8.1.4. Dentition of mammals. Finally, we consider the problem of clustering 
mammals based on their dentition [13, 32]. Eight different kinds of teeth are tallied up 
for each mammal: the number of top incisors, bottom incisors, top canines, bottom 
canines, top premolars, bottom premolars, top molars, and bottom molars. Again we 
removed observations with teeth distributions that were not unique, leaving us with 
27 mammals. Figure 8.2c shows the resulting clustering paths under two different 
choices of weights. On the left Wij = 1 for all i < j, and on the right we used 5-nearest 
neighbors and (j) — 0.5. Once again, weights sensitive to the local density give superior 
results. In contrast to the iris and senate data, the cluster path gives a different and 
perhaps more sensible solution than projections onto the first two components PCA. 
For example, the brown bat is considered more similar to the house bat and red bat, 
even though it is closer in the first two PCA coordinates to the coyote and oppossum. 

8.2. Computational Comparisons. We compared the run times between the 
subgradient descent algorithm of Hocking et al., ADMM, and AMA. We ran all al- 
gorithms, including the regular and fast versions of ADMM and AMA, on the four 
data sets described above. For all tests we assigned weights using full connectivity, 
t|^ ■> = 1 and (j) — —2. The parameter (j> was chosen to ensure that the smallest 
weight was bounded safely away from zero. We used the default stopping criterion 
for the Hocking algorithm, namely that the £2 norm of the gradient is no more than 
0.001. For our AMA algorithm we used t = 10~^. Both the regular and fast versions 
of AMA employed backtracking. For our ADMM algorithm we used r = 5 x 10~^ 
and v — 1. We used the same sequence of 7 for all algorithms. The experiments were 
performed on an iMac computer with a 3.4 GHz Intel Gore 17 processor and 8 GB of 
RAM. To ensure that all algorithms ran to comparable degree of accuracy, we plot- 
ted the relative difference in the primal objective function between the subgradient 
descent solution and our algorithm's solutions, namely 



where U*^ is the estimated solution by subgradient descent and U° * is the estimated 
solution by one of our alternating methods. Negative values indicate better solutions 
by the subgradient descent, while positive values indicate better solutions by method 

X. 

Figure 8.3a and Figure 8.3b show that the relative difference over the sequence 
of 7 used. We see that all algorithms produced solutions of comparable quality. 
Table 8.1 and Table 8.2, show the run times averaged over 10 trials. AMA appears to 
be superior in our study. We see that acceleration for AMA and ADMM does not pay 
dividends on the smaller data sets. Senate and Mammals. While ADMM is the slower 
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(a) Iris Data: Panel on the right (Set B) used k = 5 nearest neighbors and (/> = 4. 





(b) Senate: Panel on the right (Set B) used A: = 15 nearest neighbors and <j> = 0.5. 








(c) Mammal Data: Panel on the right (Set B) used fc = 5 nearest neighbors and = 0.5. 



Fig. 8.2: Clustering path under the £2 norm. All panels on the left (Set A) used 



Wi. 



1 for all i < j. 
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(b) Relative diflfercncc in primal objective between Subgradient Descent and ADMM/AMA 
(li norm penalty) 

Subgradient AMA ADMM 

Data Descent Regular Fast Regular Fast 



Half moons 


42.01 


1.56 


1.03 


7.81 


4.88 


Iris 


5.97 


1.84 


1.04 


13.59 


9.06 


Senate 


0.20 


0.06 


0.07 


4.36 


4.22 


Mammals 


0.21 


0.10 


0.08 


6.05 


6.21 



Table 8.1: Timing comparison under the £2 norm. Mean run times (sec) over 10 trials 
are reported. 



of the two, ADMM proves its worth in the half moons example where the number of 
data points is larger. 

9. Conclusion & Future Work. In this paper we introduce two splitting algo- 
rithms for solving the convex clustering problem. The splitting perspective encourages 
path following, one of the chief benefits of convex clustering. The splitting perspective 
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Subgradient AMA ADMM 

Data Descent Regular Fast Regular Fast 



Half moons 


58.31 


1.25 


0.90 


4.44 


3.12 


Iris 


12.11 


1.18 


0.78 


8.00 


5.28 


Senate 


0.30 


0.06 


0.05 


2.52 


2.58 


Mammals 


0.42 


0.08 


0.06 


5.01 


5.07 



Table 8.2: Timing comparison under the £i norm. Mean run times (sec) over 10 trials 
are reported. 



also permits centroid penalties to invoke an arbitrary norm. The only requirement is 
that the proximal map for the norm be readily computable. Equivalently, projection 
onto the unit ball of the dual norm should be straightforward. Because proximal 
maps and projection operators are generally well understood, it is possible for us to 
quantify the computational complexity and convergence properties of our algorithms. 

The accelerated AMA method appears to be the faster algorithm. On the other 
hand, ADMM is simpler to implement since the choice of the parameter i' is fixed. 
Given that alternative variants of ADMM may close the performance gap [14, 26], 
we are reluctant to dismiss ADMM too quickly. In any case, both algorithms provide 
viable alternatives to existing approaches and deserve further investigation. For in- 
stance, in both ADMM and AMA, updates of A and V could be parallelized. Hocking 
et al. also employed an active set approach to reduce computations as the centroids co- 
alesce. A similar strategy could be adopted in our framework, but it incurs additional 
overhead as checks for unfusing events have to be introduced. 

The storage demands and computational complexity of convex clustering depend 
quadratically on the number of edges of the associated weight graph in the worst case. 
Limiting a point's connections to its fc-nearest neighbors, for example, ensures that 
the number of edges in the graph is linear in the number of nodes in the graph. Elim- 
inating long-range dependencies is often desirable anyway. Choosing sparse weights 
can improve both cluster quality and computational efficiency. Moreover, finding 
the exact /c-nearest neighbors is likely not essential, and we conjecture the quality of 
solutions would not suffer greatly if approximate nearest neighbors are used and al- 
gorithms for fast computation of approximately nearest neighbors are leveraged [55] . 
On very large problems, the best strategy might be to exploit the continuity of solu- 
tion paths in the weights. This suggests starting with even sparser graphs than the 
desired one and generating a sequence of solutions to increasingly dense problems. A 
solution with fewer edges can serve as a warm start for the next problem with more 
edges. 

The splitting perspective also invite extensions that impose structured sparsity 
on the centroids. Witten and Tibshirani [66] discuss how sparse centroids can improve 
the quality of a solution, especially when only a relatively few features of data drive 
clustering. Structured sparsity can be accomplished by adding a sparsity inducing 
norm penalty to the U updates. The update for the centroids for both AMA and 
ADMM then rely on another proximal map of a gradient step. Introducing a spar- 
sifying norm, however, raises the additional complication of choosing the amount of 
penalization. 

More general problems than clustering can be solved via the proximal gradient 
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method in conjunction with sphtting. For instance, minimizing the criterion 
minimize /(U) + 7 / i(Jij||ui — Uj| 

yields to the proximal gradient method whenever the function /(U) is convex with 
a Lipschitz continuous gradient. Finally, except for a few hints about weights, our 
analysis leaves the topic of optimal clustering untouched. Recently Luxburg [64] 
has suggested some principled approaches to assessing the quality of a clustering 
assignment via data perturbation and resampling. These clues are worthy of further 
investigation. 

Appendix A. AMA performs proximal gradient ascent on the dual. 

In demonstrating the equivalence of the proximal gradient updates and the AMA 
updates (3.7), we build on the findings and notation of Section §3.1. Our point of 
departure is the observation u = V/*(A*A) if and only if u minimizes /(u)— (u, A* A). 
Thus, we can write the proximal step in stages 

u = argmin [/(u) - (Au,A)] 

u 

A — proXj,g(A — t/Au). 

We next show how the computation of the proximal map A — proXj^=(/x) can be 
decomposed into two stages. This tactic involves solving the dual of the optimization 
problem associated with the proximal map and then converting the dual solution to 
the solution of the original proximal map. In the current setting the proximal map 
returns the global solution to the problem 

1 



maximize 

A 



(A,c)-g*(B'A)--||A-M||^ 



The above maximization is dual to the equality constrained minimization over (v, z) 

minimize ^(v) + (/x,z) + -||z||| 

v,z Z 

subject to Bv + z = c. 

Strong duality holds by Slater's constraint qualification since (v, z) = (0, c) is a 
feasible point [7]. Thus, at a solution (A, v), the primal and dual objectives satisfy 



c, A) - 5*(B*A) ^ \\X-^\\l= .g(v) + (//, c - Bv) + -||c - Bvg. 



— \\\-n\\l= .9(v) + (//, c - Bv) + ^ 

Rearranging terms we get 

= -(c. A) + g*(B*A) + ^l|A - /x||^ + gM + (/x,c - Bv) + ^||c - Bv||i 

But the Fenchel- Young inequality says that g{v) + g*{'B^\) > (Bv, A). Combining 
this fact with the above equality gives the inequality, 

(/x,c-Bv) + ^||c-Bv||2. 





o>-( 


c,A) + (Bv,A) + i- 


|A 


-/^l 


2 
2 


The 


right hand 


side simplifies to 












0>||A-/x||2-2t.( 


c - 


Bv 


A 






= A — /i, — v{c — 


Bv)||i 





fi)+i^^\\c-Bv\ 



2 
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Since the right hand side must always be non-negative, we arrive at the equaUty 

X = fi + z/(c — Bv). 

This relationship tells us how to translate between the primal and dual solutions. 
Thus, we compute proXj,g(/x) in two steps. 



V — argmm 



ly , 



5(v) + (/x,c-Bv) + -||c-B 



— tsv 



2 



2' 
prox^g(/x) = ^ + z/(c - Bv). 

Taking fi = X z/Au, we can compute the entire proximal gradient update in three 
steps as 

u^ = argmin /(u) — (Au, A) 

u L 

v^ = argmin ^(v) + (A — i^Au"^,c — Bv) H — ||c — Bv|| 

A+ = A - i^Au+ + i/(c - Bv+) 
= A + i^[c-Au+-Bv+]. 

Note that the first and second of these updates are equivalent to 

u"+i = argmin £o(u,v'". A) 

U 

v'^+i = argmin £^(u'"+\v, A). 



[9 
[lo; 

[11 

[12; 
[is; 
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